Skip to content

fix: Model builder unable to (5667)#5754

Draft
sagemaker-bot wants to merge 2 commits intoaws:masterfrom
sagemaker-bot:fix/model-builder-unable-to-5667
Draft

fix: Model builder unable to (5667)#5754
sagemaker-bot wants to merge 2 commits intoaws:masterfrom
sagemaker-bot:fix/model-builder-unable-to-5667

Conversation

@sagemaker-bot
Copy link
Copy Markdown
Collaborator

Description

ModelBuilder crashes when a ModelPackage's BaseModel has HubContentName set but HubContentVersion and/or RecipeName are Unassigned (the sagemaker-core sentinel value for missing fields). The root cause is in _fetch_hub_document_for_custom_model() which passes base_model.hub_content_version directly to HubContent.get() without checking for Unassigned, and _fetch_and_cache_recipe_config() which reads base_model.recipe_name without validation. The fix is to: (1) Add a new method _resolve_base_model_fields() that auto-resolves missing hub_content_version by calling HubContent.get() with just the hub_content_name (no version), and auto-resolves missing recipe_name from the hub document's RecipeCollection. (2) Call this method early in the model-package build flow, before any code accesses these fields. (3) Update _fetch_hub_document_for_custom_model() to handle the case where hub_content_version might still be Unassigned (call without version param).

Related Issue

Related issue: 5667

Changes Made

  • sagemaker-serve/src/sagemaker/serve/model_builder.py
  • sagemaker-serve/tests/unit/test_resolve_base_model_fields.py

AI-Generated PR

This PR was automatically generated by the PySDK Issue Agent.

  • Confidence score: 85%
  • Classification: bug
  • SDK version target: V3

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Copy link
Copy Markdown
Collaborator Author

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes a crash in ModelBuilder when a ModelPackage's BaseModel has Unassigned sentinel values for hub_content_version and recipe_name. The approach is reasonable—resolving missing fields early before they're accessed. However, there are several issues: broad exception catching, potential duplicate HubContent.get calls, line length violations, and the resolution guard pattern using a dynamic attribute instead of proper initialization.

"""_is_nova_model should return False without raising when fields are Unassigned."""
mb = _make_model_builder()
base_model = _make_base_model(
hub_content_name=None, # Unassigned
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good test for the error case! However, you should also add a test for the case where _resolve_base_model_fields itself fails (the except Exception path at line 737 in model_builder.py) to verify that the method gracefully handles HubContent.get failures and that downstream code still raises the appropriate ValueError.

@sagemaker-bot
Copy link
Copy Markdown
Collaborator Author

🤖 Iteration #1 — Review Comments Addressed

Description

ModelBuilder crashes when a ModelPackage's BaseModel has hub_content_name set but hub_content_version and/or recipe_name are Unassigned (the sagemaker-core sentinel value for missing fields). This is a real problem for model packages created by automated pipelines (e.g., fine-tuning jobs) where the producer only sets hub_content_name.

Root Cause

_fetch_hub_document_for_custom_model() passes base_model.hub_content_version directly to HubContent.get() without checking for Unassigned, and _fetch_and_cache_recipe_config() reads base_model.recipe_name without validation.

Fix

  1. New method _resolve_base_model_fields() that auto-resolves missing fields:

    • hub_content_version: resolved by calling HubContent.get() on SageMakerPublicHub (without version, which returns the latest)
    • recipe_name: resolved from the first recipe in the hub document's RecipeCollection
    • Caches the HubContent response to avoid redundant API calls when resolving both fields
  2. Initialize _base_model_fields_resolved = False in __post_init__ for explicit, discoverable guard state

  3. Updated _fetch_hub_document_for_custom_model() to call _resolve_base_model_fields() internally and handle the case where hub_content_version might still be Unassigned (calls without version param)

  4. Updated _fetch_and_cache_recipe_config() to include hub_content_name in the error message for easier debugging

  5. Narrow exception handling: catches (ClientError, ValueError) instead of bare Exception

Changes

  • sagemaker-serve/src/sagemaker/serve/model_builder.py: Added _resolve_base_model_fields(), updated _fetch_hub_document_for_custom_model() and _fetch_and_cache_recipe_config()
  • sagemaker-serve/tests/unit/test_resolve_base_model_fields.py: Comprehensive tests including resolution of missing fields, no-op cases, error handling, idempotency, and graceful failure paths

Related Issue

Related issue: 5667

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit tests added/updated
  • Integration tests added (if applicable)
  • Documentation updated (if applicable)

Comments reviewed: 12
Files modified: sagemaker-serve/src/sagemaker/serve/model_builder.py, sagemaker-serve/tests/unit/test_resolve_base_model_fields.py

  • sagemaker-serve/src/sagemaker/serve/model_builder.py: Address all reviewer comments: initialize _base_model_fields_resolved in post_init, narrow exception handling, cache hub content to avoid redundant API calls, fix line length issues, remove redund
  • sagemaker-serve/tests/unit/test_resolve_base_model_fields.py: Rewrite test file addressing all reviewer comments: use 'from future import annotations', fix line lengths, add test for HubContent.get failure path.

Copy link
Copy Markdown
Collaborator Author

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes a crash when ModelPackage's BaseModel has Unassigned sentinel values for hub_content_version and recipe_name. The approach is reasonable — resolve missing fields early before they're accessed. However, there are several issues: the _resolve_base_model_fields method is too long and deeply nested, there's a missing type annotation on the return type, the _base_model_fields_resolved attribute is set in __post_init__ without a type annotation in the class body, and the error handling swallows exceptions in some cases where it probably shouldn't.

Copy link
Copy Markdown
Collaborator Author

@sagemaker-bot sagemaker-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review

This PR fixes a crash when ModelPackage's BaseModel has Unassigned sentinel values for hub_content_version and recipe_name. The approach is sound — resolving missing fields early before they're accessed. However, there are several issues: the _resolve_base_model_fields method is too long and deeply nested, there's a missing docstring on __post_init__ attribute, exception handling swallows errors too broadly, and the test file has some gaps.

if self.log_level is not None:
logger.setLevel(self.log_level)

self._base_model_fields_resolved: bool = False
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This instance attribute is set in __post_init__ but lacks a type annotation on the class body. Per SDK conventions (PEP 484, Pydantic BaseModel), all class attributes should have type annotations. Consider declaring this as a class-level field:

_base_model_fields_resolved: bool = False

in the class body (or as a PrivateAttr if this is a Pydantic model), rather than setting it in __post_init__.

@@ -680,18 +681,182 @@ def _infer_instance_type_from_jumpstart(self) -> str:

raise ValueError(error_msg)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type annotation on the return type. Per SDK conventions, all public and non-trivial private methods should have type annotations:

def _resolve_base_model_fields(self) -> None:

return

model_package = self._fetch_model_package()
if not model_package:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early-return pattern with self._base_model_fields_resolved = True is repeated 4 times before the main logic. Consider restructuring to reduce duplication — e.g., extract the chain of getattr checks into a helper that returns the base_model or None, then have a single early return:

base_model = self._get_base_model_from_package()
if not base_model:
    self._base_model_fields_resolved = True
    return

This would also make the method significantly shorter and more readable.

logger.info(
"Resolved hub_content_version to '%s' "
"for hub content '%s'.",
cached_hub_content.hub_content_version,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching ValueError here is quite broad. What ValueError would HubContent.get() raise? If this is for input validation errors from sagemaker-core, that's fine, but consider documenting why ValueError is caught alongside ClientError. Also, if the ClientError is something other than ResourceNotFoundException (e.g., AccessDeniedException), silently swallowing it and returning early could mask real permission issues. Consider only catching specific error codes:

except ClientError as e:
    if e.response['Error']['Code'] == 'ResourceNotFoundException':
        logger.warning(...)
    else:
        raise

and cached_hub_content.hub_content_version
== base_model.hub_content_version
):
hub_content = cached_hub_content
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version comparison cached_hub_content.hub_content_version == base_model.hub_content_version — at this point base_model.hub_content_version was just set to cached_hub_content.hub_content_version a few lines above (line 741), so this condition will always be True when cached_hub_content is not None. The else branch (re-fetching with a specific version) is dead code in the current flow. Consider simplifying by always reusing the cached hub content when available, or adding a comment explaining when the else branch would be hit.

),
):
from sagemaker.serve.model_builder import ModelBuilder
defaults = dict(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test helper creates a ModelBuilder by importing it inside the function and patching Session and get_execution_role. This is fragile — if ModelBuilder.__post_init__ changes to call other services, this will break. Consider also patching _fetch_model_package at the helper level, or using a more targeted approach to construct the object for unit testing.

defaults = dict(
model="dummy-model",
role_arn=(
"arn:aws:iam::123456789012:role/SageMakerRole"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The role ARN arn:aws:iam::123456789012:role/SageMakerRole has a double colon (iam::123456789012) — it's missing the partition. While this is just a test mock value, using a properly formatted ARN (arn:aws:iam:us-west-2:123456789012:role/SageMakerRole) would be more realistic and avoid any future validation issues.

mock_hub_content_cls.get.reset_mock()

mb._resolve_base_model_fields()
mock_hub_content_cls.get.assert_not_called()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test file covers the main scenarios well, but is missing a test for the case where _fetch_model_package() returns None. The test_no_base_model_is_noop test covers base_model = None, but there should be an explicit test for _fetch_model_package returning None to cover line 701 in the source.

mock_hub_content_cls.get.reset_mock()

mb._resolve_base_model_fields()
mock_hub_content_cls.get.assert_not_called()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test: the case where hub_content_document contains a RecipeCollection with an entry that has no Name key (covers the logger.warning("RecipeCollection found but first recipe has no Name...") branch at ~line 800 in the source).

Nova training recipes don't have hosting configs in the JumpStart hub document.
This provides the hardcoded fallback, matching Rhinestone's getNovaHostingConfigs().
"""
self._resolve_base_model_fields()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _resolve_base_model_fields() call was added here in _get_nova_hosting_config, but _is_nova_model() (which gates the call to this method) does NOT call _resolve_base_model_fields(). This means _is_nova_model() could return False for a Nova model if hub_content_name is Unassigned, and _get_nova_hosting_config would never be reached. Should _is_nova_model() also call _resolve_base_model_fields() first?

@aviruthen
Copy link
Copy Markdown
Collaborator

🤖 Iteration #2 — Review Comments Addressed

Description

ModelBuilder crashes when a ModelPackage's BaseModel has hub_content_name set but hub_content_version and/or recipe_name are Unassigned (the sagemaker-core sentinel value for missing fields). This is a real problem for model packages created by automated pipelines (e.g., fine-tuning jobs) where the producer only sets hub_content_name.

Root Cause

_fetch_hub_document_for_custom_model() passes base_model.hub_content_version directly to HubContent.get() without checking for Unassigned, and _fetch_and_cache_recipe_config() reads base_model.recipe_name without validation.

Fix

  1. New helper methods:

    • _get_base_model_from_package() — extracts the base_model from the model package with null-safe traversal, eliminating repeated early-return patterns
    • _normalize_field() — static utility to normalize Unassigned or falsy values to a default, replacing the verbose getattr(...) or "" + isinstance(x, Unassigned) pattern
    • _resolve_base_model_fields() -> None — auto-resolves missing hub_content_version (via HubContent.get() on SageMakerPublicHub) and recipe_name (from the hub document's RecipeCollection), with idempotency guard and cached HubContent response
  2. Narrow exception handling: Only catches ResourceNotFoundException from ClientError — permission errors (AccessDeniedException, etc.) are re-raised so they aren't silently masked

  3. Updated _fetch_hub_document_for_custom_model(): Calls _resolve_base_model_fields() internally and handles the case where hub_content_version might still be Unassigned (calls without version param)

  4. Updated _is_nova_model(): Now calls _resolve_base_model_fields() first, so Nova models with initially-Unassigned hub_content_name are correctly identified after resolution

  5. Removed redundant calls: _get_nova_hosting_config() no longer calls _resolve_base_model_fields() since _is_nova_model() (which gates all calls to it) already does

  6. Improved error messages: Cleaner f-strings without internal SDK object references

  7. Class-level field declaration: _base_model_fields_resolved: bool declared as a dataclass field with default False

Changes

  • sagemaker-serve/src/sagemaker/serve/model_builder.py: Added _normalize_field(), _get_base_model_from_package(), refactored _resolve_base_model_fields(), updated _fetch_hub_document_for_custom_model(), _fetch_and_cache_recipe_config(), _is_nova_model(), _get_nova_hosting_config()
  • sagemaker-serve/tests/unit/test_resolve_base_model_fields.py: Comprehensive tests including resolution of missing fields, no-op cases, error handling (ResourceNotFoundException vs AccessDeniedException), idempotency, RecipeCollection with no Name, _fetch_model_package returning None, and _is_nova_model calling resolution

Merge Checklist

  • Changes are backward compatible
  • Commit message follows prefix: description format
  • Unit te

Comments reviewed: 15
Files modified: sagemaker-serve/src/sagemaker/serve/model_builder.py, sagemaker-serve/tests/unit/test_resolve_base_model_fields.py

  • sagemaker-serve/src/sagemaker/serve/model_builder.py: Address all reviewer comments: add class-level type annotation, extract helper method, add return type annotations, fix error messages, simplify Unassigned handling, narrow exception handling, remove
  • sagemaker-serve/tests/unit/test_resolve_base_model_fields.py: Rewrite test file addressing all reviewer comments: fix ARN format, add tests for _fetch_model_package returning None, RecipeCollection with no Name, HubContent.get failure path, and more robust test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants